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ON TESTING THE SIGNIFICANCE OF SETS OF GENES 

By Bradley Efron 1 and Robert Tibshirani 2 

Stanford University 

This paper discusses the problem of identifying differentially ex- 
pressed groups of genes from a microarray experiment. The groups of 
genes are externally defined, for example, sets of gene pathways de- 
rived from biological databases. Our starting point is the interesting 
Gene Set Enrichment Analysis (GSEA) procedure of Subramanian et 
al. [Proc. Natl. Acad. Set. USA 102 (2005) 15545-15550]. We study 
the problem in some generality and propose two potential improve- 
ments to GSEA: the maxmean statistic for summarizing gene-sets, 
and restandardization for more accurate inferences. We discuss a va- 
riety of examples and extensions, including the use of gene-set scores 
for class predictions. We also describe a new R language package GSA 
that implements our ideas. 

1. Introduction. We discuss the problem of identifying differentially ex- 
pressed groups of genes from a set of microarray experiments. In the usual 
situation we have N genes measured on n microarrays, under two differ- 
ent experimental conditions, such as control and treatment. The number of 
genes iV is usually large, say, at least a few thousand, while the number 
samples n is smaller, say, a hundred or fewer. This problem is an example 
of multiple hypothesis testing with a large number of tests, one that often 
arises in genomic and proteomic applications, and also in signal processing. 
We focus mostly on the gene expression problem, but our proposed methods 
are more widely applicable. 

Most approaches start by computing a two-sample t-statistic Zj for each 
gene. Genes having i-statistics larger than a pre-defined cutoff (in absolute 
value) are declared significant, and then the family-wise error rate or false 
discovery rate of the resulting gene list is assessed by comparing the tail area 



Received October 2006; revised January 2007. 

Supported in part by NSF Grant DMS-05-05673 and National Institutes of Health 
Contract 8R01 EB002784. 

Supported in part by NSF Grant DMS-99-71405 and National Institutes of Health 
Contract N01-HV-28183. 

Key words and phrases. Multiple testing, gene set enrichment, hypothesis testing. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Statistics. 
2007, Vol. 1, No. 1, 107-129. This reprint differs from the original in pagination 
and typographic detail. 



1 



2 



B. EFRON AND R. TIBSHIRANI 



from a null distribution of the statistic. This null distribution is derived from 
data permutations, or from asymptotic theory. 

In an interesting and useful paper, Subramanian et al. (2005) proposed 
a method called Gene Set Enrichment Analysis (GSEA) for assessing the 
significance of pre-defined gene-sets, rather than individual genes. The gene- 
sets can be derived from different sources, for example, the sets of genes 
representing biological pathways in the cell, or sets of genes whose DNA 
sequences are close to together on the cell's chromosomes. The idea is that 
these gene-sets are closely related and, hence, will have similar expression 
patterns. By borrowing strength across the gene-set, there is potential for 
increased statistical power. In addition, in comparing study results on the 
same disease from different labs, one might get more reproducibility from 
gene-sets than from individual genes, because of biological and technical 
variability. 

The GSEA methods works roughly as follows. We begin with a pre-defined 
collection of gene-sets S\,S2, ■ ■ ■ ,Sk ■ We compute t-statistic Zj for all N 
genes in our data. Let z& = (z±, Z2, ■ ■ ■ , z m ) be the gene scores for the m genes 
in gene-set S^. In GSEA we then compute a gene-set score Sk(zk) for each 
gene-set Sk , equal to essentially a signed version of the Kolmogorov-Smirnov 
statistic between the values {Zj, j £ Sk} and their complement {zj,j ^ Sk}; 
the sign taken positive or negative depending on the direction of shift. The 
idea is that if some or all of the gene-set Sk have higher (or lower) values 
of zj than expected, their summary score Sk should be large. An absolute 
cutoff value is defined, and values of Sk above (or below) the cutoff are 
declared significant. The GSEA method then does many permutations of 
the sample labels and recomputes the statistic on each permuted dataset. 
This information is used to estimate the false discovery rate of the list of 
significant gene-sets. The Bioconductor package limma offers an analysis 
option similar to GSEA, but uses instead the simple average of the scores 
Zk [see Smyth (2004)], randomizing over the set of genes rather than the set 
of samples, what we call "row randomization" here. 

Other related ideas may be found in Pavlidis et al. (2002) and Rahnenfhrer et al. 
(2004). Nobel and Wright (2005) propose the "SAFE" methodology a quite 
general permutation-based approach to the enrichment testing problem. 
Newton et al. (2006) propose random set scoring methods for assessing the 
significance of gene-set enrichment. Zahn et al. (2006) propose an alternative 
to GSEA that uses a Van der Waerden statistic in place of the Kolmogorov- 
Smirnov statistic and bootstrap sampling of the arrays instead of a permu- 
tation distribution. Other papers that address the problem of testing for dif- 
ferentially expressed sets of genes include Szabo et al. (2003), Frisina et al. 
(2004), Lu et al. (2005) and Dettling et al. (2005). One of our goals here 
is to make explicit the choices involved between the various randomization 
schemes. 
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In studying the GSEA work, we have found some shortcomings and ways 
it could be improved. The GSEA's dependence on Kolmogorov-Smirnov 
statistics is a reasonable choice, but not a necessary one. This paper puts 
the GSEA procedure in a more theoretical framework that allows us to inves- 
tigate questions of efficiency for gene-set inference; a new procedure based on 
the "maxmean" statistic is suggested that has superior power characteristics 
versus familiar location/scale alternatives. 

Here are two simulated data examples that illustrate some of the main 
issues, and allow us to introduce our proposed solution. We generated data 
on 1000 genes and 50 samples, with each consecutive nonoverlapping block 
of 20 genes considered to be a gene-set. The first 25 samples are the control 
group, and the second 25 samples are the treatment group. First we gen- 
erated each data value as i.i.d. iV(0,l). Then the constant 2.5 was added 
to the first 10 genes in the treatment group. Thus, half of the first gene-set 
(first block of 20 genes) has a higher average expression in the treatment 
group, while all other gene-sets have no average difference in the two groups. 

The left panel of Figure 1 shows a histogram (black lines) of the GSEA 
scores for the 50 gene-sets. The first gene-set clearly stands out, with a 
value of about 0.9. We did 200 permutations of the control-treatment labels, 
producing the dashed histogram in the top left panel of Figure 1. The first 
gene-set stands out on the right side of the histogram. So the GSEA method 
has performed reasonably well in this example. 

In the paper we study alternative summary statistics for gene-sets. Our 
favorite is something we call the "maxmean statistic": we compute the av- 
erage of the positive parts of each Z{ in S, and also the negative parts, and 
choose the one that is larger in absolute value. The results for maxmean 
in this example are shown in the right panel of Figure 1. The first gene-set 
stands out more clearly than it does in the left panel. In this paper we show 
by both analytic calculations and simulations that the maxmean statistics 
are generally more powerful than GSEA. 

Now consider a different problem. We generated data exactly as before, 
except that the first 10 genes in every gene-set are 2.5 units higher in the 
treatment group. The top left panel of Figure 2 shows a histogram of the 
maxmean scores for the 50 gene-sets, and a histogram of the scores from 200 
permutations of the sample labels (dashed). All of the scores look signifi- 
cantly large compared to the permutation values. But given the way that 
the data were generated, there seems to be nothing special about any one 
gene-set. To quantify this, we "row randomized" the 1000 genes, leaving the 
sample labels as is. The first 20 genes in the scrambled set became the first 
gene-set, the second 20 genes became the second gene-set, and so on. We did 
this many 200 times, recomputing the maxmean statistic on each scrambled 
set. The results are shown in the bottom left panel of Figure 2. None of 
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Fig. 1. Example 1: left panel shows histograms of GSEA scores for original data (black) 
and from 200 permutations of the sample labels (dashed). The first gene-set stands our 
fairly clearly with a score of about 0.7. The right panel shows the results using instead 
the u maxmean" statistic. The first gene-set stands out much more clearly than in the left 
panel. 



the original 50 gene-sets are notable when compared to gene-sets formed by 
randomly sampling from the full set of genes. 

The point is that any method for assessing gene-sets should compare a 
given gene-set score not only to scores from permutations of the sample la- 
bels, but also take into account scores from sets formed by random selections 
of genes. Both GSEA and SAFE do something like this, at least implicitly. 

The bottom right panel of Figure 2 puts all of these ideas together. It 
uses the restandardized version of the maxmean statistic, in which we cen- 
ter and scale the maxmean statistic by its mean and standard deviation 
under the row randomizations, like those in the bottom left panel. This 
standardized maxmean statistic is then computed both on the original data 
(light histogram) and on each of the permuted datasets (dark histogram). 
As we would expect, we see no significant gene-sets. This restandardization 
is potentially important for any gene summary statistic: it turns out that the 
GSEA statistic incorporates a form of what we are calling restandardization. 

The two ideas illustrated above — alternative summary statistics for gene- 
sets and restandardization based on row randomization — are two of the main 
proposals in this paper. In Section 2 we describe the randomization and 
permutation methods for estimating appropriate null distributions. Section 3 
studies the choice of summary statistic for gene-sets, and introduces the 
maxmean statistic. In Section 4 we summarize our proposal for Gene Set 
Analysis and discuss computational issues. A simulation study is carried out 
in Section 5, comparing the power of the maxmean statistic to the GSEA 
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Fig. 2. Example 2: top /e/J panel shows a histogram of the maxmean scores. In the top 
right panel a histogram of maxmean scores from 200 permutations of the sample labels has 
been added (dashed): all of the gene-sets look to be significant. In the bottom left panel 
a histogram of the maxmean scores is shown along with maxmean scores from 200 "row 
randomizations" — gene-sets chosen from the full collection of genes at random (dashed). 
Finally, in the bottom right panel, histograms of the restandardized scores from the original 
data (solid) and from 200 permutations of the sample labels (dashed) are shown. (Note that 
the horizontal axis has changed, since the values have been centered and scaled). None of 
the gene-sets looks significant, which is reasonable, given the way the data were generated. 



statistic and other competitors. Finally, in Section 6 we give a number of 
examples of the method, including applications to a kidney cancer dataset, 
a generalization to the class prediction problem and comparison of different 
gene-set collections over the same expression dataset. 
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2. Randomization and permutation. A straightforward approach to gene- 
set analysis begins with some enrichment score S and computes its signif- 
icance by comparison with permutation values S*. Here we argue that a 
second kind of comparison operation, "row randomization," is also needed 
to avoid bias in the determination of significance. We begin with a simplified 
statement of the gene-set problem, leading later to a more realistic analysis. 

Let X indicate an N by n matrix of expression values, N genes and n 
microarrays, with the first n\ columns of X representing Class 1 and the last 
ri2 Class 2, n\ + ni = n. In the p53 example of Subramanian et al. (2005) 
there are N = 10100 genes and n = 50 arrays relating to cell lines with 
normal or mutated states for the p53 factor, n\ = 17 normal and ri2 = 33 
mutated. 

The ith row of X , that is, the data for gene i, yields a two-sample t- 
statistic "V comparing the classes. In this section it will be convenient 
to transform the U values to z-values il Zi" theoretically having a standard 
normal distribution, 



under the null hypothesis of no difference between the two treatments. [The 
transformation is Z{ = <I> _1 (i ? n _2(ti)), where $ is the standard normal cumu- 
lative distribution function (c.d.f.) while -F n _2 is the c.d.f. for a t distribution 
having n — 2 degrees of freedom.] The methods here apply outside the realm 
of genes, microarrays and t-tests, but that is the main application we have 
in mind. 

To begin with, suppose that a single gene-set "5," comprising m genes, 
is under consideration, and we wish to test the hypothesis that S is en- 
riched, meaning that its m z-values are unusually large, positively or neg- 
atively, in a sense to be defined. The scientific idea underlying enrichment, 
as nicely stated in Subramanian et al. (2005), is that a biologically related 
set of genes, perhaps representing a genetic pathway, could reveal its impor- 
tance through a general effect on its constituent z-values, whether or not 
the individual Zi's were significantly nonzero. 

Let zs indicate the set of m z-values in S and 



define an enrichment test statistic, with larger value of S indicating greater 
enrichment. The GSEA algorithm uses a Kolmogorov-Smirnov type function 
for S(zs). A simpler approach starts with a function "s" of the individual 
z- values, 





(2.2) 



S = S(z s ) 





s 



ON TESTING THE SIGNIFICANCE OF SETS OF GENES 



7 



The choice s(z) = \z\ will be discussed later. Limma [Smyth (2004)], a mi- 
croarray analysis program available in the Bioconductor R package, im- 
plements (2.4) with s(z) = z, so S is simply z$, the average z-score in S 
(actually using S = \zg\ for two-sided inference) . Section 3 develops a third 
choice, the "maxmean" statistic. 

Testing S for enrichment requires a null hypothesis distribution for S, 
and that is where difficulties begin; there are two quite different models for 
what "null" might mean. We discuss these next, in a spirit closely related 
to "Ql" and "Q2" of Tian et al. (2005). 

Randomization Model. The null hypothesis -f^rand is that S has been 
chosen by random selection of m genes from the full set of N genes. In this 
case the null density of S, say, grand(S'), can be obtained by row randomiza- 
tion: sets of m rows of the expression matrix X are drawn at random, 
giving randomized versions of (2.4), a large number of which are gener- 
ated and used to estimate g ra . n d(S)- Equivalently, random subsets z^ of size 
m are drawn from all iV z^s, giving = S(z§). 

Permutation Model. Let X$ be the m by n submatrix of X correspond- 
ing to S. The null hypothesis H pcrm is that the n columns of X$ are in- 
dependent and identically distributed m-vectors (i.i.d.). The null density of 
S, g P crm(S), is obtained by column permutations, leading to an estimate of 
9perm(S) in the usual way. 

The Randomization Model has the appealing feature of operating condi- 
tionally on the set z of all N z-values: given z, it tests the null hypothesis 
that the observed S is no bigger than what might be obtained by a random 
selection of S. 

Its defect concerns correlation between the genes. Suppose the scores Sj 
in (2.3) have empirical mean and standard deviation 

(2.5) s ~ (mean s ,stdev s ) 

for the ensemble of all N genes. Because row randomization destroys gene- 
wide correlations, the randomized version of (2.4) will have mean and stan- 
dard deviation 

(2.6) S-t-GuV), 
with 

(2.7) fi^ = mean s and a' = stdev s /y/m. 

However, will underestimate the variability of S if there is positive cor- 
relation among the z- values in S, which seems likely if S was chosen on the 
basis of biological similarities. The p- value in the Limma package compares 
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Fig. 3. Permutation standard deviation compared to randomization standard deviation 
for S = zs, 395 gene-sets for p53 data, Subramanian et al. (2005). Permutation values 
exceed randomization values, especially for gene-sets with large m at left, reflecting positive 
correlations within gene-sets. 



S = zs with the row randomization distribution of S' , ignoring, perhaps 
dangerously, correlation effects. 

Permutation methods deal nicely with the correlation problem. Let Rs 
indicate the correlation matrix of the i.i.d. columns of X$ in the Permu- 
tation Model. Because column-wise permutations maintain the integrity of 
column vectors, it turns out, under some regularity conditions, that z s , the 
permutation vector of z- values in S, will have correlation matrix approxi- 
mating Rg, as will zg itself. In other words, for a prechosen gene-set 5, the 
permutation correlation matrix of zs will be approximately correct. 

Figure 3 concerns correlation effects in the p53 example of Subramanian et al. 
(2005); the authors provide a catalog of 522 gene sets, of which 395 have 
m > 10 members. Taking s(z) = z yields 

(2.8) (mean s , stdev,) = (0.00, 1.13), 

as in (2.5), this being the mean and standard deviation of all those Zj's 
involved in the 395 gene-sets. 400 permutation values of S* = z s , the average 
z* values in S, were generated for each of the 395 gene-sets S, and the 
permutation standard deviation a s computed. 
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FlG. 4. Enrichment statistics S for 129 randomly selected gene-sets of size m = 25, heavy 
line histogram, lies far to right of permutation S* values, solid histogram. S = ^2 S \zi\/m, 
BRCA data, Hedenfalk et al. (2001). Restandardized values (2.12), light line histogram, 
gives a much better match to actual values. 

Figure 3 plots <j* s versus = 1.13/ y/ms, the randomization standard 
deviation (2.7); <Js/ a s has median 1.08 (light line), but the ratio is consid- 
erably greater for large m, with median 1.50 for m > 50. 

The Permutation Model has a serious weakness of its own: it does not 
take into account the overall statistics (mean S) stdev s ), (2.5), as demon- 
strated next. Figure 4 concerns the "BRCA data," Efron (2004), a microar- 
ray experiment comparing two classes of breast cancer patients, N = 3226 
genes, n = 15 microarrays, n = 7 for class 1, n,2 = 8 for class 2. A two- 
sample t-statistic ti was computed for each gene and converted to z-value, 
Zi = $ _1 (Fi3(tj)). The N z-values have mean and standard deviation 

(2.9) z~ (-0.026, 1.43). 

In this case (2.1), the theoretical iV(0, 1) null, substantially underesti- 
mates the z-values' variability, their histogram looking like an overdispersed 
but somewhat short-tailed normal. 

For this example 129 gene-sets S of size m = 25 each were constructed 
by random selection from the N genes. Enrichment was tested with S = 
Y^s \ z i\l m i that is, by using (2.3) and (2.4) with s(z) = \z\. Figure 4 shows 
that the permutation distribution S* greatly underestimates the actual 129 
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S values. A standard Benjamini-Hochberg (1995) FDR analysis based on 
permutation p-values, carried out at rejection level q = 0.10, labels 113 of 
the 129 generate as "enriched," even though we know that all of them were 
constructed randomly. 

It is easy to spot the trouble here: the permutation z-values are much less 
dispersed than the actual z's, with marginal mean and standard deviation, 
for all N of them, 

(2.10) z* ~ (0.022, 1.014) 

compared to (2.9). [This is a dependable feature of the permutation t-test, 
which tends to yield z* ~ iV(0, 1), the theoretical null, under a wide variety 
of circumstances.] 

The individual scores s* = s(z*) are correspondingly reduced. Letting 
(mean*, stdev*) indicate the mean and standard deviation of s* over all 
N genes and a large number of permutations, 

(2.11) s* ~ (mean*, stdev*), 

we computed (mean*, stdev*) = (0.82,0.60) compared to (mean S) stdev, ) = 
(1.17,0.82) for the actual scores (2.5). This translates into S* statistics that 
are much smaller than the actual S values. 



2.1. Restandardization. We combine the randomization and permuta- 
tion ideas into a method that we call "Restandardization." This can be 
thought of as a method of correcting permutation values S* to take account 
of the overall distribution of the individual scores Sj. For a gene-set score 
S = J2s s ( z i)/ m as m (2-3) and (2.4), the restandardized permutation value 
is defined to be 

(2.12) #** = me an s + ^^0S*- mean*), 

stdev 

with (mean,,, stdev, ) and (mean*, stdev*) the overall means and standard 
deviations defined at (2.5) and (2.11). 

A restandardized p- value for testing the enrichment of gene-set S is ob- 
tained by comparing its actual enrichment score S to a large number u B n 
of S** simulations, 

(2.13) p s = #{S** values exceeding S}/B. 

Figure 4 shows the restandardized value S** giving much better agreement 
than S* with the actual S values, though the S** distribution is moderately 
overdispersed. 

Here is another equivalent way to express the restandardization proce- 
dure: we first standardize the statistic S with respect to its randomization 
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mean and standard deviation, and then compute the permutation distribu- 
tion of the standardized statistic. This gives the p-value 

,, f S* — mean* , 5 - mean. ] / 

< 2 - I4 » ^ = H^w" exceedmg "Sd^r }/ B ' 

It is this form of restandardization that we use in our software package GSA, 
described in Section 4. 

Restandardization can also be applied to more complicated gene-set statis- 
tics S, not simple averages as in (2.4), via 

(2.15) S** = ^ + ^{S* - fi*). 

Here (/x^,cj^) are the mean and standard deviation of for a randomly 
selected gene-set of size m, as in (2.5), while (fi*,a*) are the corresponding 
quantities based on a permuted data matrix; computing (/i*,<7*) requires a 
nested simulation. 

The GSEA enrichment test incorporates a form of restandardization: it 
compares the empirical c.d.f. of the z-values in S with the c.d.f. of all 
other z-values; the latter c.d.f. brings to bear information from the over- 
all distribution of z, much like mean s and stdev s do in (2.14). This same 
approach could be used in context (2.4) by replacing s(zi) in (2.3) with 
t(zi) = (s(zi) — mean s )/stdev s , leading to 

(2.16) T = t(zi) jm = (S — mean s )/stdev s 

s 

as the enrichment statistic for S. Then (2.14) reduces to 

(2.17) Ps = #{T* > T}/B, 

so that the restandardized p-value equals the usual permutation p-value. 

Restandardization can be shown to yield improved inferences in a variety 
of situations: 

• If S was selected randomly, as in the Randomization Model. 

• If the theoretical null (2.1), z ~ iV(0, 1), agrees with the empirical distri- 
bution of the iV z-values. 

• If the ZiS are uncorrelated. 

The method is not perfect, as examples can be constructed to show. Nei- 
ther the Randomization nor Permutation Models perfectly describe how 
gene-sets S come under consideration in the catalog examples of Subramanian et al. 
(2005), making some sort of compromise formula a necessity. 
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3. Efficient tests for enrichment. The Randomization and Permutation 
Models specify null hypotheses for the selection of a gene-set S. Notions 
of efficient testing, however, require us to specify alternatives to the null 
selection. This section begins with an exponential family selection model, 
considers some specific test statistics, and goes on to propose the "maxmean" 
statistic, which leads to enrichment tests with good power characteristics. 

The Poisson Selection Model starts with independent Poisson variates 

(3.1) h'-Poivi), v l = aeP' s */T p 

for i = 1, 2, . . . , TV, where Sj = s(zi), for s(-) a given J-dimensional function, 
(3 an unknown J-dimensional parameter vector, 

N 

(3.2) 2> = ]>X S S 

i=l 

and q = Yli v % an unknown scalar parameters. In what follows, the Vi will 
be small and the Ii almost all zeros or ones, mostly zeros. For convenient 
exposition we assume that all the 1% are zeros or ones, though this is not 
essential. 

We define the selected gene-set S as composed of those genes having 
h = l, 

(3.3) S = {i:Ii = l}, 
so S has 

N 

(3.4) m = J2li 

l 

members. It is easy to calculate the probability g a ^(S) of selecting my 
particular gene-set S, 

( c —ct m \ 
e -^y m \e m ^-^), 

where 

(3.6) s s = ^2si/m. 

s 

This is a product of two exponential family likelihoods, yielding maximum 
likelihood estimates a = m and (3 defined by 

N , N 

(3.7) E^7i;^=s s . 

i=l ' i=l 

Parameter (3 = corresponds to the Randomization Model null hypoth- 
esis that S has been selected at random. Given gene-set size m, the suffi- 
cient statistic for testing Hq:(3 = Q is ss, called U S" at (2.4). An efficient 
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test rejects Hq if ss is far from its null expectation s = J2i S i/N. If s$ is 
one-dimensional, as in Section 2, we reject for extreme values of s$, either 
positive or negative. 

Under Ho : /? = 0, all gene-sets S of size m are equally likely. Nonzero (5 
"tilts" the selection toward gene-sets having larger values of P'ss, as seen in 
(3.5) where 

(3.8) gp(S\m) = mle m[f3 " ss - logT ^ . 

Efficient choice of a gene-set enrichment test statistic S = z~s depends on 
the individual scoring function s, = s{zi) in (3.3). Consider two choices: 

(3.9) s {1) {z)=z and s {2 \z) = \z\, 

both of which are univariate and have (5 a scalar, J = 1. The first of these, 
the Limma choice, uses S = %; enriched gene-sets S will tend to have zs far 
away from z = J2i Z i/N ■ This is fine if enrichment manifests itself as a shift 
of location in ^-values. The second choice s^ 2 \z) = \z\ has power against 
scale alternatives. 

For general use one would like enrichment test statistics having power 
against both shift and scale alternatives. To this end, define the two-dimensional 
scoring function 

, 3 ,o) 

and the maxmean test statistic 

(3.11) S max = max{4 +) ,S5~ ) }- 

Smax is designed to detect unusually large z-values in either or both direc- 
tions. Note that S max is not the same as the largest absolute value of the 
mean of the positive or negative z values in the gene-set: it divides by the 
total number of genes m. If, for example, a gene-set of 100 genes had 99 
scores of —0.5 and one score of 10, the average of the positive scores would 
be 10, and the average of the negative scores equal to —0.5, so the mean with 
largest absolute value would be 10. But the maxmean statistic is the max- 
imum of 10/100 = 0.1 and — 99(— 0.5)/100 = 0.495, and the negative scores 
would win. The maxmean statistic is robust by design, not allowing a few 
large positive or negative gene scores to dominate. 

Figure 5 compares the power of 5 max with the limma test that rejects 
the null hypothesis of no enrichment for large values of the absolute mean 
statistic \zs\- Here S consists of m = 25 independent normal z- values, 

(3.12) Zi l ~N(b,g 2 ), i = 1, 2, . . . ,m = 25. 
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Contours of equal power are shown, for testing Hq : (b,g) = (0, 1) versus al- 
ternatives (6 > 0, g > 1) [or, equivalently, (b < 0, g > 1)], at level 0.95; better 
power is reflected by contours nearer the null hypothesis point (b,g) = (0, 1). 

As expected, the absolute mean test has good power for shift alternatives 
6^0, but no power for scale alternatives g > 1, as shown by its nearly 
vertical contour lines. The maxmean test shows reasonable power in both 
directions. Comparing the 0.5 power contours, for instance, shows maxmean 
almost dominating \zs\- Also shown is "KS," the 0.5 power contour for a 
test based on absolute Kolmogorov-Smirnov distance between the empirical 
c.d.f. of the 25 z-values and a N(0, 1) c.d.f. (The GSEA test is based on 
an improved variant of the Kolmogorov-Smirnov statistic.) In this case, 
maxmean dominates ks, its contour being everywhere closer to (6, g) = (0, 1) . 
Finally, we show the results for the test based on the mean absolute value: 
this has the best power for pure scale alternatives but is not very sensitive 
to location shifts. 




i 1 1 1 r 



0-0 D.2 O.fl 0.$ Q.0 10 

b 

Fig. 5. Contours of equal power for absolute mean, enrichment statistic \zs\ (jagged 
vertical line) mean of absolute value (curving broken line), and maxmean statistic (heavy 

curving lines); for testing Ho : Zi ~ iV(0, 1) vs N(b, g 2 ), i = 1, 2, . . . , m = 25. at level 0.95; 
"fcs" is 0.5 power contour for the Kolmogorov-Smirnov test, dominated by maxmean test 
with 0.5 contour everywhere closer to null point (0, 1). Using mean absolute value, average 
over S of \z\ has very little power versus shift alternatives, curved dashed line. 
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Using a similar setup, the left panel of Figure 6 compares the power of the 
maxmean statistic to that of the individual gene scores Zj. We see that the 
maxmean statistic has a strong power advantage. In the right panel we have 
changed the setup slightly, with half of the genes in the gene-set generated 
as N(b,g 2 ), and the other half as N(—b,g 2 ). The increase in power enjoyed 
by the maxmean statistic is slightly less than in the left panel, but it is still 
substantial. 

We applied the maxmean statistic to the p53 data with the catalog of 
522 gene-sets in Subramanian et al. (2005). Restandardization, (2.12), no- 
ticeably improved results, though the effect was much less dramatic than in 
Figure 4. Applying the FDR algorithm to the restandardized p- value (2.13), 
cutoff level q = 0.10, yielded 8 significant gene-sets, Table 1, in close agree- 
ment with Table 2 of Subramanian et al. (2005). 

Significance is defined in a two-sided testing sense here, but we can use 
Sg and in (3.11) separately. Gene-sets 4 and 7 in Table 1 have 
predominating, indicating positive enrichment in the mutant class, while the 
other six gene-sets had bigger z- values overall in the normal class. 

Note. Restandardization formula (2.12) was applied separately to 
and Sg "* in (3.11), and then combined to give 

(3.13) SL = max{i +)w , S t )w }. 

Figure 7 displays the top eight pathways, showing the individual gene 
scores Zi in each pathway. The inset at the top shows the estimated FDR 




Fig. 6. Contours of power 0.5 (solid lines) for individual gene scores Zi and power 

0.5, 0.7 ... 0.9 (broken lines) for the maxmean statistic; for testing Ho:zi ~ N(0, 1) vs 
N(±b,g 2 ),i = 1,2, ... ,m = 25. In the left panel, all genes in the gene-set are shifted by b; 
in the right panel, half of the genes are shifted by b and half by —b. 
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Table 1 

Significant gene-sets, FDR cutoff 0.10, for p53 data, 
Subramanian et al. (2005); maxmean statistic 
restandardized 



1. p53 hypoxia* 

2. p53 pathway* 



5. p53 Up* 

6. hsp27 pathway* 

7. n.g.f. pathway 

8. SA Gl and S phases 



3. radiation sensitivity* 

4. ras pathway* 



'Indicate significant gene-sets listed in Subramanian et al. 
(2005), Table 2. They list the n.g.f. pathway as almost sig- 
nificant. 



for an individual gene analysis. We see that the cutoff for an FDR of 0.10 
is about —4.7 for negative scores, while the FDR for positive scores never 
gets down to 0.10. Hence, an analysis of individual genes would report as 
significant the top few genes in the negative gene-sets, and none of the 
genes in the positive gene-sets. Note that location shift alternatives look 
quite reasonable here. 

The values of (mean s , stdev s ) and (mean*, stdev*) used in (2.12) were 
based on the catalog of 15,059 genes listed (with multiple occurrences) in the 
522 gene-sets, rather than the 10,100 genes in the original p53 data set. Table 
2 shows considerable differences between the two choices for (mean s , stdev s ), 
though not for (mean*, stdev*), with s = s^ + \z), the comparison for s~(z) 
being similar. When a catalog of gene-sets S is simultaneously under con- 
sideration for enrichment, the catalog choice of (mean s , stdev s ) is preferable 
from the row randomization viewpoint. 

The Poisson Selection Model (3.1) and (3.2) follows the Randomization 
Model of Section 2 in that the set z of N original z-values is treated as a 
fixed ancillary, with only the selection process for S considered random. We 
could, instead, think of z-values selected for inclusion in S as coming from 
a different underlying distribution, for example, 



where fo(z) is the baseline density for nonenriched z-values, j3 and s(z) are 
as m (3.1), and Tp = /f^e^'W f (z) dz. If the members of S are selected 
independently according to (3.14), then the most powerful test for null hy- 
pothesis = is based on ss = J2s s ( z )/ m i J us t as m (3.6). 

This point of view is more congenial to the Permutation Model of Sec- 
tion 2, but there is really not much practical difference from (3.1). Let Fq 
denote the empirical distribution 



(3.14) 




(3.15) 



F [a,b] = #{ Zl in [a,b]}/B, 
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Fig. 7. p53 data: the top eight pathways showing the individual gene scores Zi in each 
pathway. The inset at the top shows the estimated FDR for an individual gene analysis. 



Table 2 

Comparison of (mean, stdev) values for s = s^ + \z), (3.10); for original list of 10,100 
genes in p53 data, versus 15,059 genes listed with multiple occurrences, in catalog of 522 
gene-sets. Latter is preferred choice for restandardization formula (2.12) 





mean s 


(stdevs) 


mean* 


(stdev*) 


Original 10,100 genes 


0.451 


(0.628) 


0.406 


(0.589) 


Catalog 15,059 genes 


0.456 


(0.671) 


0.404 


(0.587) 
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and Fp the tilted distribution 



N 



(3.16) 



Fp[a,b\ = ^2{fj,i for Zi in [0,6]}/^/^. 



1 



Then (3.1) and (3.2) give 



(3.17) 



F [a,b] 



r 

J a 



e p ' s ^ z) dFo(z) 



just as in (3.14). 

More elaborate probability models can be constructed to make, say, the 
Kolmogorov-Smirnov distance the optimal test statistic. The strongpoint of 
the maxmean enrichment statistic is its good performance against simple 
location and scale alternatives, as seen in Figure 5. 

4. Computational issues and software. The developments in the previous 
two sections lead to our Gene Set Analysis procedure, which we summarize 
here: 

1. Compute a summary statistic Z{ for each gene, for example, the two- 
sample i-statistic for two-class data. Let z$ be the vector of Z{ values for 
genes in a gene-set S. 

2. For each gene-set S, choose a summary statistic S = s(z): choices in- 
clude the average of z% or \zi\ for genes in S, the GSEA statistic or, our 
recommended choice, the maxmean statistic (3.11). 

3. Standardize S by its randomization mean and standard deviation as in 
(2.14): S' = (S — mean s )/stdev s . For summary statistics such as the mean, 
mean absolute value or maxmean (but not GSEA), this can be computed 
from the genewise means and standard deviations, without having to 
draw random sets of genes. Note formula (3.13) for the maxmean statistic. 

4. Compute permutations of the outcome values (e.g., the class labels in 
the two-class case) and recompute S' on each permuted dataset, yielding 
permutation values S"* 1 , S'* 2 , . . . , S'* B . 

We use these permutation values to estimate p- values for each gene-set score 
S' and false discovery rates applied to these p-values for the collection of 
gene-set scores. 

It would be convenient computationally to pool the permutation values 
for different gene-sets: this is often done in the analysis of individual genes 
[e.g., in the SAM package Tusher et al. (2001)] and also in the GSEA soft- 
ware. Such pooling reduces the number of permutations needed to obtain 
sufficient accurate results. However, we have found that in gene-set analysis 
the permutation distribution for each gene-set depends heavily on both the 
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number of genes in the gene-set and their average pairwise correlation (nei- 
ther of which applies to individual-gene analysis). Hence, we do not pool the 
permutation values, and so we must carry out at least 1000 permutations to 
get accurate results. However, with a careful implementation this need not 
be an obstacle for practical use. 

We have written an R language package GSA ("gene-set analysis") for 
carrying out these computations. This will be freely available along with 
collections of gene-sets for use with the package. 

5. Simulation comparison of different gene-set summaries. In this study 
we simulated 1000 genes and 50 samples in each of 2 classes, control and 
treatment. The genes were assigned to with 50 gene-sets, each with 20 genes. 
All measurements were generated as N(Q, 1) before the treatment effect was 
added. There were five different scenarios: 

1. all 20 genes of gene-set 1 are 0.2 units higher in class 2, 

2. the 1st 15 genes of gene-set 1 are 0.3 units higher in class 2, 

3. the 1st 10 genes of gene-set 1 are 0.4 units higher in class 2, 

4. the 1st 5 genes are 0.6 units higher in class 2, 

5. the 1st 10 genes of gene-set 1 are 0.4 units higher in class 2, and 2nd 10 
genes of gene-set 1 are 0.4 units lower in class 2. 

In every one of these scenarios only the first gene-set is of potential inter- 
est. For each scenario, we carried out 200 permutations and report the es- 
timated tail probability Prob(S"* > S[), with small values being good. Here 
S' is the restandardized version of a summary statistic, and the quantity 
Prob(S"* > S[) is the observed p-value for the first gene-set. 

We compared five different summary statistics: the mean of Zi, the mean 
of \zi\, maxmean, GSEA and GSEA applied to \zi\. Everything was repeated 
20 times, and the mean and standard error of the tail probability over the 
20 simulations are reported in Table 3. While the maxmean can be beaten 
by a small margin by the mean or the mean of the absolute values in the 
one-sided or two-sided scenarios, respectively, it is the only method with 
consistently low p- values in all five scenarios. 

6. Examples. 

6.1. Survival with kidney cancer. In this example we apply the gene-set 
methodology to a dataset with censored survival outcome. Zhao et al. (2005) 
collected gene expression data on 14,814 genes from 177 kidney patients. 
Survival times were also measured for each patient. The data were split into 
88 samples to form the training set and the remaining 89 formed the test 
set. 
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Table 3 

Results of simulation study: p-values for the first gene-set, for five different summary 
statistics (columns) and five different scenarios (rows, described in the text). The first 
four scenarios are one-sided shifts, while the last one is two-sided. We see that the 
maxmean statistic is the only one with consistently low p-values in all five scenarios 





mean 


mean.abs 


maxmean 


GSEA 


GSEA.abs 


(1) 












mean 


0.014 


0.133 


0.012 


0.032 


0.192 


se 


0.008 


0.038 


0.005 


0.017 


0.060 


(2) 












mean 


0.005 


0.035 


0.002 


0.016 


0.074 


se 


0.003 


0.014 


0.001 


0.008 


0.034 


(3) 












mean 


0.014 


0.032 


0.002 


0.031 


0.057 


se 


0.008 


0.015 


0.001 


0.018 


0.032 


(4) 












mean 


0.074 


0.081 


0.014 


0.069 


0.037 


se 


0.032 


0.050 


0.007 


0.038 


0.014 


(5) 












mean 


0.587 


5e-04 


0.018 


0.233 


0.011 


se 


0.106 


4e-04 


0.008 


0.063 


0.009 



We computed the Cox partial likelihood score statistic z% for each gene, 
and used this as the basis for gene-set analysis using the maxmean statistics 
Sk- Since there are separate training and test sets, we examined the con- 
cordance between the scores Sk for a given gene-set between the training 
and test sets. Denote the training and test set scores by S l k r and S l k e . We 
measured the concordance by setting a cutpoint c and defining a test set 
value with \S l k e \ > c to be a "true positive"; otherwise it is called a "true 
negative." Then we applied the same cutpoint to the training set scores S k r 
and then computed the false positive and false negative rates. We applied 
this same idea both to scores for the individual genes and also to gene-set 
enrichment analysis, based on the Cox scores. The cutpoint c was varied to 
generate the curves of false positive and false negative rates in Figure 8. We 
see that both the maxmean statistic and GSEA statistics have considerably 
lower false positive rates, for a given false negative rate, than the individual 
gene analysis. The maxmean statistic also enjoys an advantage over GSEA, 
which is not surprising given the results from our power studies. In this case 
it seems that a list of significant gene-sets based on the maxmean statistic 
will tend to be more reproducible than a list of significant genes from an 
individual gene analysis. 
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Fig. 8. Kidney cancer microarray data: false positive and false negative rates for three 
methods, using the test set values for each method to define the "true positive" and "true 
negative" for genes or gene-sets. 

6.2. Comparisons of different gene-set collections. The gene-sets used in 
the examples of this paper are the collection of 522 gene pathways devel- 
oped by Subramanian et al. (2005). This is the C2 "functional" collection 
in their Molecular Signature Database. The same authors provide three ad- 
ditional collections: CI (chromosomal location, 24 sets), C3 (motif-based, 
319 sets ) and C4 (chromosomal location, 427 sets). The Stanford Microar- 
ray Database has at least two more collections in their "synethetic gene" 
database that are potentially useful for this purpose: tissue types (80 sets) 
and cellular processes (22 sets). We applied our gene-set analysis technique 
using the maxmean statistic to both the p53 and kidney cancer datasets, 
using each of these six collections of gene-sets. The results are shown in Fig- 
ure 9. In both cases the FDR results in the left tail are shown for gene-sets 
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1*53 data 
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Fig. 9. Gene set analysis applied to the p53 mutation data and kidney cancer data, 
comparing the estimated FDRs over six different collections of gene-sets. In both cases the 
FDR results in the left tail (negative gene-set scores) are shown. The FDRs for the right 
tail (positive gene-set scores) were high for both datasets. 

with negative maxmean statistics. The FDRs for the right tail were high 
for both datasets. We see that different collections exhibit the lowest FDRs 
in the two studies: motif, chromosomes and pathways for the p53 data and 
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correlation for the kidney cancer data. This suggests that, for a given expres- 
sion dataset, systematic "mining" of different gene-set collections could be 
a useful exercise for discovering potentially interested gene-sets for further 
investigation. 
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